Skip to main content

Creating a New Synthetic Dataset

The Synthetic Data feature enables users to generate customized datasets using advanced models such as GPT-4. This process involves selecting input data from files or databases, configuring data generation settings, and applying filters and output formats.

This document outlines the steps involved in creating a synthetic dataset, from selecting the model to finalizing the dataset configuration.

Steps to Create a New Synthetic Dataset

1. Select Model

In the first step, users provide basic details about the synthetic dataset:

  • Dataset Name: Enter a name for your dataset.
  • Model Type: Choose the LLM model (e.g., GPT-4) you wish to use for data generation.
  • Prompt: Optionally, provide a prompt to guide the model in generating data based on your requirements.

Once these details are filled in, click Next to proceed to the configuration options.


2. Configure Input Source

The second step involves selecting and configuring the input source for data generation. You can choose between two options: File or Database.

File Input Configuration

For file-based input, the following options are available:

  • File Type: Choose between JSON, text, or upload a new file.
  • Upload New File: Drag and drop a file into the input area, or select a file from your device.
  • Sample Size: Specify the number of data points to generate.
  • Temperature: Adjust the creativity of the model. A higher temperature results in more diverse and creative outputs.
  • Top P: Control the diversity of the generated content by adjusting the range of predicted words.

Database Input Configuration

For database-based input, users can connect to a database:

  • Database Name and Host: Provide the database name and host details.
  • Sample Size: Set the number of data points to generate.
  • Temperature and Top P: These options allow you to control the creativity and diversity of the generated data in the same way as with file input.

Once the input source is configured, click Next to move on to the column and filter selection.


3. Column Selection

In this step, you’ll specify which columns from the input data should be included in the generated synthetic dataset:

  • Select Table: Choose the relevant table when using database input.
  • Select Columns: Select the specific columns you want to include in the final synthetic dataset. You can choose all columns or select only the ones that are relevant to your needs.

4. Filter Selection

In the filter selection step, users can further refine the dataset:

  • Choose Filter: Apply filters based on specific conditions such as column values, ranges, or limits.
  • Custom Filters: Define custom conditions to further refine the dataset. You can add multiple filters and specify conditions like equal to (=), greater than (>), less than (<), and more.

Once the filters are applied, click Next to review the final dataset summary.